First, let’s load the R packages necessary for our project.
Access the data from PhysioNet from the WIDS Datathon Project Page. Download ‘training_v2.csv’ from the “Files” section of the project.
Now we can load in our data and start to work with it.
We will include four predictors in our model: age, bmi, gender, and glucose apache. Feel free to include more or choose entirely different predictors in your model!
# Select subset of variables from original data
reduced_data <- gossis %>% select(bmi,
age,
gender,
glucose_apache,
diabetes_mellitus, h1_glucose_max, d1_glucose_max,ethnicity,wbc_apache)
head(reduced_data) %>% kable()
| bmi | age | gender | glucose_apache | diabetes_mellitus | h1_glucose_max | d1_glucose_max | ethnicity | wbc_apache |
|---|---|---|---|---|---|---|---|---|
| 22.73 | 68 | M | 168 | 1 | NA | 168 | Caucasian | 14.1 |
| 27.42 | 77 | F | 145 | 1 | 145 | 145 | Caucasian | 12.7 |
| 31.95 | 25 | F | NA | 0 | NA | NA | Caucasian | NA |
| 22.64 | 81 | F | 185 | 0 | NA | 185 | Caucasian | 8.0 |
| NA | 19 | M | NA | 0 | NA | NA | Caucasian | NA |
| 27.56 | 67 | M | 156 | 1 | NA | 156 | Caucasian | 10.9 |
We will also define our outcome variable and drop data with unknown outcomes
# Set the outcome variable here
reduced_data$outcome_variable <- as.factor(reduced_data$diabetes_mellitus)
reduced_data <- subset(reduced_data, select = -c(diabetes_mellitus))
# Check number of rows
nrow(reduced_data)
## [1] 91713
# For simplicity, we will drop all rows with missing outcomes
reduced_data <- drop_na(reduced_data, any_of("outcome_variable"))
# Check number of rows after removing rows with missing outcomes
nrow(reduced_data)
## [1] 90998
We’ll encode our categorical variables. Encoding is the process of reshaping and binarizing categorical data to better suit machine learning models.
# Encode gender variable: male = 1, non-male = 0
reduced_data$gender <- ifelse(reduced_data$gender == "M", 1, 0)
Let’s create the training and test datasets.
# Set the random number seed for reproducibility
set.seed(1)
# Create data partition using the outcome variable
train_index <- createDataPartition(reduced_data$outcome_variable,
times = 1, p = 0.8, list = FALSE)
# Split data into train and test sets, select columns that will be used in model
train_set <- reduced_data[train_index, ]
head(train_set)
## # A tibble: 6 × 9
## bmi age gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 22.7 68 1 168 NA 168 Caucasian
## 2 32.0 25 0 NA NA NA Caucasian
## 3 22.6 81 0 185 NA 185 Caucasian
## 4 57.4 59 0 197 197 197 Caucasian
## 5 NA 70 1 164 NA 129 Caucasian
## 6 NA 45 1 380 365 365 Caucasian
## # … with 2 more variables: wbc_apache <dbl>, outcome_variable <fct>
test_set <- reduced_data[-train_index, ]
head(test_set)
## # A tibble: 6 × 9
## bmi age gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 27.4 77 0 145 145 145 Caucasian
## 2 NA 19 1 NA NA NA Caucasian
## 3 27.6 67 1 156 NA 156 Caucasian
## 4 25.7 50 1 134 134 134 <NA>
## 5 38.2 81 1 120 121 121 Caucasian
## 6 23.4 79 0 175 NA 175 Caucasian
## # … with 2 more variables: wbc_apache <dbl>, outcome_variable <fct>
library(tableone) makes it easy to create these summary statisticsallvars <- c("bmi", "age", "gender", "glucose_apache", "h1_glucose_max", "d1_glucose_max","ethnicity", "wbc_apache")
catvars <- c("gender")
table_gossis <- CreateTableOne(vars = allvars, data = train_set,
factorVars = catvars,
strata = "outcome_variable")
kableone(table_gossis, caption = "Summary statistics of the training set")
| 0 | 1 | p | test | |
|---|---|---|---|---|
| n | 56405 | 16394 | ||
| bmi (mean (SD)) | 28.41 (7.86) | 31.82 (9.00) | <0.001 | |
| age (mean (SD)) | 61.57 (17.41) | 64.62 (14.45) | <0.001 | |
| gender = 1 (%) | 30270 (53.7) | 9044 (55.2) | 0.001 | |
| glucose_apache (mean (SD)) | 141.36 (73.15) | 218.80 (112.45) | <0.001 | |
| h1_glucose_max (mean (SD)) | 147.92 (73.63) | 214.95 (118.80) | <0.001 | |
| d1_glucose_max (mean (SD)) | 154.69 (68.89) | 239.15 (104.81) | <0.001 | |
| ethnicity (%) | <0.001 | |||
| African American | 5620 (10.1) | 1977 (12.2) | ||
| Asian | 647 ( 1.2) | 234 ( 1.4) | ||
| Caucasian | 43997 (79.3) | 12106 (74.6) | ||
| Hispanic | 2279 ( 4.1) | 748 ( 4.6) | ||
| Native American | 419 ( 0.8) | 218 ( 1.3) | ||
| Other/Unknown | 2509 ( 4.5) | 953 ( 5.9) | ||
| wbc_apache (mean (SD)) | 12.11 (6.97) | 12.19 (6.72) | 0.243 |
We have to consider the NA values in our data. For simplicity, we will replace missing values with the median of the train_set.
Note that it is important that we do this after splitting into training and test sets. If we imputed values using test data, we would risk leaking information about our test set into our training set (“data leakage”).
Feel free to experiment with different imputation techniques!
predictors <- c("age", "bmi", "gender", "glucose_apache", "h1_glucose_max", "d1_glucose_max", "ethnicity", "wbc_apache")
for (col in predictors) {
test_set[[col]][is.na(test_set[[col]])] <-
median(train_set[[col]], na.rm = TRUE)
train_set[[col]][is.na(train_set[[col]])] <-
median(train_set[[col]], na.rm = TRUE)
}
For many models it is important to normalize the data. Without normalization, variables with larger magnitudes may have a greater effect on the model.
One common approach for normalization is to divide each variable by its standard deviation and subtract the mean. As before, to avoid data leakage, normalization should be done after splitting into training and test sets.
Our tree model does not require normalization, so we will skip this step.
randomForest library to build a random forest model.hist.data.frame(train_set)
for example: method = “svmPoly”, method = “glm”
# Define forest tuning parameters
# 5-fold cross validation
control <- trainControl(method = "repeatedcv", number = 2)
# Number of variables tried at each split
mtry <- sqrt(ncol(train_set))
# Grid search is a linear search through a vector of mtry values
tunegrid <- expand.grid(.mtry = mtry)
# Create classification forest using age, bmi, and gender
forest <- train(outcome_variable ~ age + bmi + gender + glucose_apache + h1_glucose_max + d1_glucose_max + ethnicity + wbc_apache,
data = train_set,
method = "rf",
metric = "Accuracy",
tuneGrid = tunegrid,
trControl = control)
caret variable importance Documentation# Calculate variable importance
importance <- varImp(forest)
kable(importance$importance)
| Overall | |
|---|---|
| age | 47.0295952 |
| bmi | 67.4601747 |
| gender | 5.8278661 |
| glucose_apache | 80.2409013 |
| h1_glucose_max | 44.6706070 |
| d1_glucose_max | 100.0000000 |
| ethnicityAsian | 0.1917897 |
| ethnicityCaucasian | 4.1155818 |
| ethnicityHispanic | 1.2941783 |
| ethnicityNative American | 0.0000000 |
| ethnicityOther/Unknown | 1.4721757 |
| wbc_apache | 50.3555599 |
We can see that glucose_apache and bmi had the most predictive power.
head(test_set, 5) %>% select(age, bmi, gender, glucose_apache, h1_glucose_max, d1_glucose_max, ethnicity, wbc_apache)
## # A tibble: 5 × 8
## age bmi gender glucose_apache h1_glucose_max d1_glucose_max ethnicity
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 77 27.4 0 145 145 145 Caucasian
## 2 19 27.7 1 133 140 150 Caucasian
## 3 67 27.6 1 156 140 156 Caucasian
## 4 50 25.7 1 134 134 134 Caucasian
## 5 81 38.2 1 120 121 121 Caucasian
## # … with 1 more variable: wbc_apache <dbl>
original_test_set <- test_set
hist.data.frame(original_test_set)
allvars <- c("bmi", "age", "gender", "glucose_apache", "h1_glucose_max", "d1_glucose_max","ethnicity", "wbc_apache")
catvars <- c("gender")
table_gossis <- CreateTableOne(vars = allvars, data = original_test_set,
factorVars = catvars,
strata = "outcome_variable")
kableone(table_gossis, caption = "Summary statistics of the test set")
| 0 | 1 | p | test | |
|---|---|---|---|---|
| n | 14101 | 4098 | ||
| bmi (mean (SD)) | 28.37 (7.81) | 31.70 (8.82) | <0.001 | |
| age (mean (SD)) | 61.85 (16.81) | 65.08 (13.90) | <0.001 | |
| gender = 1 (%) | 7543 (53.5) | 2242 (54.7) | 0.175 | |
| glucose_apache (mean (SD)) | 140.80 (68.46) | 214.12 (111.57) | <0.001 | |
| h1_glucose_max (mean (SD)) | 143.03 (45.38) | 182.37 (97.99) | <0.001 | |
| d1_glucose_max (mean (SD)) | 154.69 (66.49) | 236.51 (105.41) | <0.001 | |
| ethnicity (%) | <0.001 | |||
| African American | 1388 ( 9.8) | 476 (11.6) | ||
| Asian | 181 ( 1.3) | 64 ( 1.6) | ||
| Caucasian | 11261 (79.9) | 3067 (74.8) | ||
| Hispanic | 542 ( 3.8) | 193 ( 4.7) | ||
| Native American | 98 ( 0.7) | 47 ( 1.1) | ||
| Other/Unknown | 631 ( 4.5) | 251 ( 6.1) | ||
| wbc_apache (mean (SD)) | 11.70 (6.10) | 11.93 (6.21) | 0.030 |
####Original results:
# Make predictions on test set
test_set <- original_test_set
forest_pred <- predict(forest,
newdata = test_set,
type = "raw")
# Combine unseen patients data with corresponding predictions
data.frame(age = test_set$age,
bmi = test_set$bmi,
gender = test_set$gender,
glucose_apache = test_set$glucose_apache,
h1_glucose_max = test_set$h1_glucose_max,
d1_glucose_max = test_set$d1_glucose_max,
ethnicity = test_set$ethnicity,
wbc_apache = test_set$wbc_apache,
prediction = forest_pred,
truth_value = test_set$outcome_variable) %>%
head(30) %>%
kable()
| age | bmi | gender | glucose_apache | h1_glucose_max | d1_glucose_max | ethnicity | wbc_apache | prediction | truth_value |
|---|---|---|---|---|---|---|---|---|---|
| 77 | 27.42000 | 0 | 145.0 | 145.000 | 145 | Caucasian | 12.7 | 0 | 1 |
| 19 | 27.65432 | 1 | 133.0 | 140.000 | 150 | Caucasian | 10.4 | 0 | 0 |
| 67 | 27.56000 | 1 | 156.0 | 140.000 | 156 | Caucasian | 10.9 | 0 | 1 |
| 50 | 25.71000 | 1 | 134.0 | 134.000 | 134 | Caucasian | 8.4 | 0 | 0 |
| 81 | 38.18907 | 1 | 120.0 | 121.000 | 121 | Caucasian | 6.9 | 0 | 0 |
| 79 | 23.40898 | 0 | 175.0 | 140.000 | 175 | Caucasian | 13.5 | 0 | 0 |
| 65 | 27.65432 | 0 | 205.0 | 140.000 | 208 | Caucasian | 10.1 | 0 | 1 |
| 60 | 26.48571 | 1 | 149.0 | 140.000 | 149 | African American | 7.5 | 0 | 0 |
| 71 | 34.30569 | 0 | 157.0 | 140.000 | 157 | Caucasian | 5.4 | 0 | 0 |
| 58 | 27.65432 | 0 | 88.0 | 140.000 | 88 | Caucasian | 6.7 | 0 | 0 |
| 85 | 21.88964 | 0 | 81.0 | 140.000 | 81 | Caucasian | 6.0 | 0 | 0 |
| 65 | 26.28642 | 1 | 361.0 | 140.000 | 361 | African American | 10.4 | 1 | 0 |
| 82 | 23.73812 | 1 | 121.0 | 140.000 | 121 | Caucasian | 8.5 | 0 | 0 |
| 64 | 16.98039 | 1 | 96.0 | 140.000 | 113 | Caucasian | 21.8 | 0 | 0 |
| 59 | 14.84493 | 0 | 71.0 | 140.000 | 182 | African American | 1.8 | 0 | 0 |
| 55 | 32.20536 | 1 | 81.0 | 95.000 | 95 | Caucasian | 8.2 | 0 | 0 |
| 62 | 27.26740 | 1 | 72.0 | 87.000 | 87 | Caucasian | 3.5 | 0 | 0 |
| 56 | 34.15440 | 0 | 133.0 | 140.000 | 150 | Caucasian | 10.4 | 0 | 0 |
| 73 | 20.35197 | 0 | 159.0 | 140.000 | 159 | Caucasian | 15.6 | 0 | 0 |
| 48 | 26.51648 | 1 | 96.0 | 140.000 | 96 | Caucasian | 5.6 | 0 | 0 |
| 65 | 33.00135 | 0 | 94.0 | 140.000 | 94 | Caucasian | 10.4 | 0 | 0 |
| 55 | 24.12382 | 1 | 114.0 | 140.000 | 114 | Caucasian | 4.1 | 0 | 0 |
| 71 | 35.07812 | 0 | 181.0 | 140.000 | 181 | Caucasian | 12.1 | 0 | 1 |
| 47 | 28.86719 | 0 | 76.0 | 82.000 | 121 | Asian | 4.7 | 0 | 0 |
| 75 | 27.65432 | 0 | 598.7 | 695.045 | 611 | Caucasian | 11.4 | 1 | 1 |
| 55 | 23.74531 | 0 | 160.0 | 192.000 | 192 | African American | 16.2 | 0 | 0 |
| 74 | 27.65432 | 0 | 324.0 | 140.000 | 324 | Caucasian | 18.0 | 1 | 1 |
| 68 | 18.45329 | 0 | 260.0 | 291.000 | 291 | Caucasian | 27.6 | 0 | 0 |
| 89 | 27.25875 | 1 | 106.0 | 140.000 | 106 | Caucasian | 15.4 | 0 | 0 |
| 69 | 34.75311 | 0 | 133.0 | 140.000 | 145 | Caucasian | 10.4 | 0 | 0 |
# Get the probabilities for our forest predictions
forest_probs <- predict(forest, newdata = test_set, type = "prob")
forest_probs <- forest_probs[, 2]
# Create a "prediction" object using our probabilities and the outcome variable
forest_roc <- prediction(forest_probs, test_set$outcome_variable)
forest_perf <- performance(forest_roc, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(forest_perf, col = rainbow(10))
abline(a = 0, b = 1)
# Calculate the AUC
auc_forest <- performance(forest_roc, measure = "auc")
auc_forest <- auc_forest@y.values[[1]]
# Ground truth is recorded in the GOSSIS data
cm = confusionMatrix(forest_pred,
test_set$outcome_variable,
positive = "1")$tab
Reference <- factor(c(0, 0, 1, 1))
Prediction <- factor(c(0, 1, 0, 1))
Y <- c(cm[1], cm[2], cm[3], cm[4])
df_2 <- data.frame(Reference, Prediction, Y)
ggplot(data = df_2, mapping = aes(x = Reference, y = Prediction)) + ggtitle("Overall prediction")+
geom_tile(aes(fill = Y), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Y)), vjust = .5, fontface = "bold", alpha = 1) +
scale_fill_gradient(low="white", high="#009194") +
theme_bw() + theme(legend.position = "none")+ coord_fixed()
accuracy <- (cm[1] + cm[4])/(cm[1] + cm[2] + cm[3] + cm[4])
recall = cm[4]/ (cm[4] + cm[2])
specificity = cm[1]/ (cm[1] + cm[3])
print("auc:")
## [1] "auc:"
print(auc_forest)
## [1] 0.8130455
print("accuracy:")
## [1] "accuracy:"
print(accuracy)
## [1] 0.8068026
print("recall:")
## [1] "recall:"
print(recall)
## [1] 0.612442
print("specificity:")
## [1] "specificity:"
print(specificity)
## [1] 0.8390238
df_result <- data.frame(name = character(), auc = double(), accuracy = double(), recall = double(), specificity = double(), range = character())
library(ggplot2)
var_list <- c("gender", "ethnicity")
for (var in var_list)
{
print(var)
list <- unique(original_test_set[[var]])
for (i in list)
{
test_set <- original_test_set[original_test_set[[var]] == i, ]
num = dim(test_set)[1]
# Make predictions on test set
forest_pred <- predict(forest,
newdata = test_set,
type = "raw")
data.frame(age = test_set$age,
bmi = test_set$bmi,
gender = test_set$gender,
glucose_apache = test_set$glucose_apache,
h1_glucose_max = test_set$h1_glucose_max,
d1_glucose_max = test_set$d1_glucose_max,
ethnicity = test_set$ethnicity,
wbc_apache = test_set$wbc_apache,
prediction = forest_pred,
truth_value = test_set$outcome_variable)
# Get the probabilities for our forest predictions
forest_probs <- predict(forest, newdata = test_set, type = "prob")
forest_probs <- forest_probs[, 2]
# Create a "prediction" object using our probabilities and the outcome variable
forest_roc <- prediction(forest_probs, test_set$outcome_variable)
forest_perf <- performance(forest_roc, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(forest_perf, col = rainbow(10), main= paste(var,":",i,"(",num,")"))
abline(a = 0, b = 1)
# Calculate the AUC
auc_forest <- performance(forest_roc, measure = "auc")
auc_forest <- auc_forest@y.values[[1]]
# Ground truth is recorded in the GOSSIS data
cm = confusionMatrix(forest_pred,
test_set$outcome_variable,
positive = "1")$tab
Reference <- factor(c(0, 0, 1, 1))
Prediction <- factor(c(0, 1, 0, 1))
Y <- c(cm[1], cm[2], cm[3], cm[4])
df_2 <- data.frame(Reference, Prediction, Y)
print(ggplot(data = df_2, mapping = aes(x = Reference, y = Prediction)) + ggtitle(paste(var,":",i,"(",num,")")) +
geom_tile(aes(fill = Y), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Y)), vjust = .5, fontface = "bold", alpha = 1) +
scale_fill_gradient(low="white", high="#009194") +
theme_bw() + theme(legend.position = "none")+ coord_fixed())
accuracy <- (cm[1] + cm[4])/(cm[1] + cm[2] + cm[3] + cm[4])
recall = cm[4]/ (cm[4] + cm[2])
specificity = cm[1]/ (cm[1] + cm[3])
df_result <- rbind(df_result, data.frame(name = var, auc = auc_forest,accuracy = accuracy,recall = recall,specificity = specificity, range = i))
print(paste(i,"(", num, "):"))
print("auc:")
print(auc_forest)
print("accuracy:")
print(accuracy)
print("recall:")
print(recall)
print("specificity:")
print(specificity)
}
}
## [1] "gender"
## [1] "0 ( 8414 ):"
## [1] "auc:"
## [1] 0.8209621
## [1] "accuracy:"
## [1] 0.8168529
## [1] "recall:"
## [1] 0.6338148
## [1] "specificity:"
## [1] 0.8466215
## [1] "1 ( 9785 ):"
## [1] "auc:"
## [1] 0.8059802
## [1] "accuracy:"
## [1] 0.7985692
## [1] "recall:"
## [1] 0.5961675
## [1] "specificity:"
## [1] 0.832617
## [1] "ethnicity"
## [1] "Caucasian ( 14328 ):"
## [1] "auc:"
## [1] 0.8179748
## [1] "accuracy:"
## [1] 0.8161642
## [1] "recall:"
## [1] 0.6157135
## [1] "specificity:"
## [1] 0.8462712
## [1] "African American ( 1864 ):"
## [1] "auc:"
## [1] 0.777936
## [1] "accuracy:"
## [1] 0.7757511
## [1] "recall:"
## [1] 0.5917722
## [1] "specificity:"
## [1] 0.8133075
## [1] "Asian ( 245 ):"
## [1] "auc:"
## [1] 0.7922566
## [1] "accuracy:"
## [1] 0.7632653
## [1] "recall:"
## [1] 0.5833333
## [1] "specificity:"
## [1] 0.7942584
## [1] "Hispanic ( 735 ):"
## [1] "auc:"
## [1] 0.814394
## [1] "accuracy:"
## [1] 0.7768707
## [1] "recall:"
## [1] 0.5986395
## [1] "specificity:"
## [1] 0.8214286
## [1] "Native American ( 145 ):"
## [1] "auc:"
## [1] 0.8129614
## [1] "accuracy:"
## [1] 0.737931
## [1] "recall:"
## [1] 0.6285714
## [1] "specificity:"
## [1] 0.7727273
## [1] "Other/Unknown ( 882 ):"
## [1] "auc:"
## [1] 0.8015829
## [1] "accuracy:"
## [1] 0.7721088
## [1] "recall:"
## [1] 0.6373626
## [1] "specificity:"
## [1] 0.8071429
df <- data.frame(name = "age", range = c(20, 40, 60, 80))
df <- rbind(df, data.frame(name = "bmi", range = c(20, 30, 40)))
df <- rbind(df, data.frame(name = "glucose_apache", range = c(100, 200, 300)))
df <- rbind(df, data.frame(name = "h1_glucose_max", range = c(100, 200, 300)))
df <- rbind(df, data.frame(name = "d1_glucose_max", range = c(100, 200, 300)))
df <- rbind(df, data.frame(name = "wbc_apache", range = c(5, 10, 15, 20)))
var_list <- c("age","bmi","glucose_apache", "h1_glucose_max", "d1_glucose_max", "wbc_apache")
for (var in var_list)
{
print(var)
var_range = df[df$name==var,"range"]
list <- unique(original_test_set[[var]])
for (i in 0:length(var_range)+1)
{
if(i==1){
test_set <- original_test_set[original_test_set[[var]] < var_range[i], ]
}
else if(i==length(var_range)+1){
test_set <- original_test_set[original_test_set[[var]] > var_range[i-1], ]
}
else{
test_set <- original_test_set[original_test_set[[var]] > var_range[i-1] & original_test_set[[var]] < var_range[i], ]
}
num = dim(test_set)[1]
if(i==1){
range_text = paste("[", i, "]", "<", var_range[i], "(",num, ")")
}
else if(i==length(var_range)+1){
range_text = paste("[", i, "]", ">", var_range[i-1], "(",num, ")")
}
else{
range_text = paste("[", i, "]", var_range[i-1],"~", var_range[i], "(",num, ")")
}
# Make predictions on test set
forest_pred <- predict(forest,
newdata = test_set,
type = "raw")
data.frame(age = test_set$age,
bmi = test_set$bmi,
gender = test_set$gender,
glucose_apache = test_set$glucose_apache,
h1_glucose_max = test_set$h1_glucose_max,
d1_glucose_max = test_set$d1_glucose_max,
ethnicity = test_set$ethnicity,
wbc_apache = test_set$wbc_apache,
prediction = forest_pred,
truth_value = test_set$outcome_variable)
# Get the probabilities for our forest predictions
forest_probs <- predict(forest, newdata = test_set, type = "prob")
forest_probs <- forest_probs[, 2]
# Create a "prediction" object using our probabilities and the outcome variable
forest_roc <- prediction(forest_probs, test_set$outcome_variable)
forest_perf <- performance(forest_roc, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(forest_perf, col = rainbow(10), main= paste(var,":",range_text))
abline(a = 0, b = 1)
# Calculate the AUC
auc_forest <- performance(forest_roc, measure = "auc")
auc_forest <- auc_forest@y.values[[1]]
print(paste(range_text,":"))
# Ground truth is recorded in the GOSSIS data
cm = confusionMatrix(forest_pred,
test_set$outcome_variable,
positive = "1")$tab
Reference <- factor(c(0, 0, 1, 1))
Prediction <- factor(c(0, 1, 0, 1))
Y <- c(cm[1], cm[2], cm[3], cm[4])
df_2 <- data.frame(Reference, Prediction, Y)
print(ggplot(data = df_2, mapping = aes(x = Reference, y = Prediction)) + ggtitle(range_text) +
geom_tile(aes(fill = Y), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Y)), vjust = .5, fontface = "bold", alpha = 1) +
scale_fill_gradient(low="white", high="#009194") +
theme_bw() + theme(legend.position = "none")+ coord_fixed())
accuracy <- (cm[1] + cm[4])/(cm[1] + cm[2] + cm[3] + cm[4])
recall = cm[4]/ (cm[4] + cm[2])
specificity = cm[1]/ (cm[1] + cm[3])
df_result <- rbind(df_result, data.frame(name = var, auc = auc_forest,accuracy = accuracy,recall = recall,specificity = specificity, range = range_text))
print("auc:")
print(auc_forest)
print("accuracy:")
print(accuracy)
print("recall:")
print(recall)
print("specificity:")
print(specificity)
}
}
## [1] "age"
## [1] "[ 1 ] < 20 ( 160 ) :"
## [1] "auc:"
## [1] 0.9911937
## [1] "accuracy:"
## [1] 0.9625
## [1] "recall:"
## [1] 0.7
## [1] "specificity:"
## [1] 1
## [1] "[ 2 ] 20 ~ 40 ( 1653 ) :"
## [1] "auc:"
## [1] 0.9303992
## [1] "accuracy:"
## [1] 0.9122807
## [1] "recall:"
## [1] 0.6790123
## [1] "specificity:"
## [1] 0.9376258
## [1] "[ 3 ] 40 ~ 60 ( 4691 ) :"
## [1] "auc:"
## [1] 0.8298128
## [1] "accuracy:"
## [1] 0.8166702
## [1] "recall:"
## [1] 0.5799419
## [1] "specificity:"
## [1] 0.857357
## [1] "[ 4 ] 60 ~ 80 ( 8360 ) :"
## [1] "auc:"
## [1] 0.7820416
## [1] "accuracy:"
## [1] 0.7767943
## [1] "recall:"
## [1] 0.6246334
## [1] "specificity:"
## [1] 0.8064608
## [1] "[ 5 ] > 80 ( 2420 ) :"
## [1] "auc:"
## [1] 0.773025
## [1] "accuracy:"
## [1] 0.8144628
## [1] "recall:"
## [1] 0.6103896
## [1] "specificity:"
## [1] 0.8359982
## [1] "bmi"
## [1] "[ 1 ] < 20 ( 1426 ) :"
## [1] "auc:"
## [1] 0.834415
## [1] "accuracy:"
## [1] 0.8870968
## [1] "recall:"
## [1] 0.5042735
## [1] "specificity:"
## [1] 0.921314
## [1] "[ 2 ] 20 ~ 30 ( 10204 ) :"
## [1] "auc:"
## [1] 0.815502
## [1] "accuracy:"
## [1] 0.8360447
## [1] "recall:"
## [1] 0.6160991
## [1] "specificity:"
## [1] 0.8591229
## [1] "[ 3 ] 30 ~ 40 ( 4930 ) :"
## [1] "auc:"
## [1] 0.7814041
## [1] "accuracy:"
## [1] 0.7561866
## [1] "recall:"
## [1] 0.6107249
## [1] "specificity:"
## [1] 0.7935254
## [1] "[ 4 ] > 40 ( 1636 ) :"
## [1] "auc:"
## [1] 0.7548043
## [1] "accuracy:"
## [1] 0.7114914
## [1] "recall:"
## [1] 0.6424242
## [1] "specificity:"
## [1] 0.7414549
## [1] "glucose_apache"
## [1] "[ 1 ] < 100 ( 4572 ) :"
## [1] "auc:"
## [1] 0.7937578
## [1] "accuracy:"
## [1] 0.861986
## [1] "recall:"
## [1] 0.6477273
## [1] "specificity:"
## [1] 0.8661909
## [1] "[ 2 ] 100 ~ 200 ( 9650 ) :"
## [1] "auc:"
## [1] 0.753888
## [1] "accuracy:"
## [1] 0.8596891
## [1] "recall:"
## [1] 0.5737705
## [1] "specificity:"
## [1] 0.8671061
## [1] "[ 3 ] 200 ~ 300 ( 2503 ) :"
## [1] "auc:"
## [1] 0.6470318
## [1] "accuracy:"
## [1] 0.6048742
## [1] "recall:"
## [1] 0.619211
## [1] "specificity:"
## [1] 0.592371
## [1] "[ 4 ] > 300 ( 1269 ) :"
## [1] "auc:"
## [1] 0.5876709
## [1] "accuracy:"
## [1] 0.6052009
## [1] "recall:"
## [1] 0.6152399
## [1] "specificity:"
## [1] 0.5533981
## [1] "h1_glucose_max"
## [1] "[ 1 ] < 100 ( 1193 ) :"
## [1] "auc:"
## [1] 0.8047401
## [1] "accuracy:"
## [1] 0.8365465
## [1] "recall:"
## [1] 0.6027397
## [1] "specificity:"
## [1] 0.8517857
## [1] "[ 2 ] 100 ~ 200 ( 15255 ) :"
## [1] "auc:"
## [1] 0.7951695
## [1] "accuracy:"
## [1] 0.8254998
## [1] "recall:"
## [1] 0.582517
## [1] "specificity:"
## [1] 0.8486502
## [1] "[ 3 ] 200 ~ 300 ( 1029 ) :"
## [1] "auc:"
## [1] 0.626052
## [1] "accuracy:"
## [1] 0.5986395
## [1] "recall:"
## [1] 0.6352941
## [1] "specificity:"
## [1] 0.5483871
## [1] "[ 4 ] > 300 ( 636 ) :"
## [1] "auc:"
## [1] 0.5722972
## [1] "accuracy:"
## [1] 0.6509434
## [1] "recall:"
## [1] 0.6672504
## [1] "specificity:"
## [1] 0.5076923
## [1] "d1_glucose_max"
## [1] "[ 1 ] < 100 ( 1889 ) :"
## [1] "auc:"
## [1] 0.7760518
## [1] "accuracy:"
## [1] 0.9534145
## [1] "recall:"
## [1] 1
## [1] "specificity:"
## [1] 0.9533898
## [1] "[ 2 ] 100 ~ 200 ( 11849 ) :"
## [1] "auc:"
## [1] 0.7331695
## [1] "accuracy:"
## [1] 0.8575407
## [1] "recall:"
## [1] 0.5461538
## [1] "specificity:"
## [1] 0.860995
## [1] "[ 3 ] 200 ~ 300 ( 2830 ) :"
## [1] "auc:"
## [1] 0.6437554
## [1] "accuracy:"
## [1] 0.5968198
## [1] "recall:"
## [1] 0.609736
## [1] "specificity:"
## [1] 0.5871446
## [1] "[ 4 ] > 300 ( 1456 ) :"
## [1] "auc:"
## [1] 0.5759749
## [1] "accuracy:"
## [1] 0.6112637
## [1] "recall:"
## [1] 0.625817
## [1] "specificity:"
## [1] 0.5344828
## [1] "wbc_apache"
## [1] "[ 1 ] < 5 ( 994 ) :"
## [1] "auc:"
## [1] 0.7947097
## [1] "accuracy:"
## [1] 0.8138833
## [1] "recall:"
## [1] 0.6168224
## [1] "specificity:"
## [1] 0.837655
## [1] "[ 2 ] 5 ~ 10 ( 5506 ) :"
## [1] "auc:"
## [1] 0.8156433
## [1] "accuracy:"
## [1] 0.806393
## [1] "recall:"
## [1] 0.6215236
## [1] "specificity:"
## [1] 0.8390682
## [1] "[ 3 ] 10 ~ 15 ( 7902 ) :"
## [1] "auc:"
## [1] 0.8221807
## [1] "accuracy:"
## [1] 0.8171349
## [1] "recall:"
## [1] 0.6213848
## [1] "specificity:"
## [1] 0.8501701
## [1] "[ 4 ] 15 ~ 20 ( 1960 ) :"
## [1] "auc:"
## [1] 0.792613
## [1] "accuracy:"
## [1] 0.7897959
## [1] "recall:"
## [1] 0.57
## [1] "specificity:"
## [1] 0.8295181
## [1] "[ 5 ] > 20 ( 1602 ) :"
## [1] "auc:"
## [1] 0.794275
## [1] "accuracy:"
## [1] 0.7734082
## [1] "recall:"
## [1] 0.5804598
## [1] "specificity:"
## [1] 0.7969188
var_list <- c("gender","ethnicity","age","bmi","glucose_apache", "h1_glucose_max", "d1_glucose_max", "wbc_apache")
for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, auc)) +
xlab(var) + ylab("AUC") +
geom_bar(stat = "identity"))
print("Gap:")
print(max(tmp$auc)-min(tmp$auc))
}
## [1] "Gap:"
## [1] 0.01498192
## [1] "Gap:"
## [1] 0.04003881
## [1] "Gap:"
## [1] 0.2181687
## [1] "Gap:"
## [1] 0.07961078
## [1] "Gap:"
## [1] 0.2060869
## [1] "Gap:"
## [1] 0.2324429
## [1] "Gap:"
## [1] 0.2000769
## [1] "Gap:"
## [1] 0.02956778
for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, accuracy)) +
xlab(var) + ylab("Accuracy") +
geom_bar(stat = "identity"))
print("Gap:")
print(max(tmp$accuracy)-min(tmp$accuracy))
}
## [1] "Gap:"
## [1] 0.01828363
## [1] "Gap:"
## [1] 0.07823312
## [1] "Gap:"
## [1] 0.1857057
## [1] "Gap:"
## [1] 0.1756053
## [1] "Gap:"
## [1] 0.2571119
## [1] "Gap:"
## [1] 0.2379071
## [1] "Gap:"
## [1] 0.3565947
## [1] "Gap:"
## [1] 0.04372666
for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, recall)) +
xlab(var) + ylab("Recall") +
geom_bar(stat = "identity"))
print("Gap:")
print(max(tmp$recall)-min(tmp$recall))
}
## [1] "Gap:"
## [1] 0.03764729
## [1] "Gap:"
## [1] 0.0540293
## [1] "Gap:"
## [1] 0.1200581
## [1] "Gap:"
## [1] 0.1381507
## [1] "Gap:"
## [1] 0.07395678
## [1] "Gap:"
## [1] 0.08473348
## [1] "Gap:"
## [1] 0.4538462
## [1] "Gap:"
## [1] 0.05152358
for (var in var_list)
{
tmp <- df_result %>% filter(name == var)
print(ggplot(tmp, aes(range, specificity)) +
xlab(var) + ylab("Specificity") +
geom_bar(stat = "identity"))
print("Gap:")
print(max(tmp$specificity)-min(tmp$specificity))
}
## [1] "Gap:"
## [1] 0.01400453
## [1] "Gap:"
## [1] 0.0735439
## [1] "Gap:"
## [1] 0.1935392
## [1] "Gap:"
## [1] 0.1798591
## [1] "Gap:"
## [1] 0.313708
## [1] "Gap:"
## [1] 0.3440934
## [1] "Gap:"
## [1] 0.4189071
## [1] "Gap:"
## [1] 0.05325133